Skip to content

gh-151497: Avoid huge pre-allocation for oversized tarfile extended headers#151498

Merged
encukou merged 2 commits into
python:mainfrom
iamsharduld:gh-tarfile-extheader-memory
Jun 23, 2026
Merged

gh-151497: Avoid huge pre-allocation for oversized tarfile extended headers#151498
encukou merged 2 commits into
python:mainfrom
iamsharduld:gh-tarfile-extheader-memory

Conversation

@iamsharduld

Copy link
Copy Markdown
Contributor

tarfile reads a member's extended header (a GNU long name/link, or a pax
header) with a single read sized directly by the header's size field:

buf = tarfile.fileobj.read(self._block(self.size))

self.size is taken from the archive and is not validated, so a ~512-byte
crafted file can claim several gigabytes (or, via base-256 encoding, far more)
and make read() pre-allocate that much memory — on open/iterate
(tarfile.open(...).getmembers()), before any extraction filter runs. A
512-byte archive claiming 1 GiB drives a ~950 MiB resident allocation; a claim
of 1 TiB raises MemoryError even on high-RAM machines.

This reads the extended-header data in bounded chunks instead, so an oversized
or truncated header can no longer force a huge up-front allocation. The bytes
returned for valid archives are unchanged, and the change is safe for both
seekable and streaming (r|) tars.

…nded headers

tarfile reads a member's extended header (a GNU long name/link or a pax
header) with a single read sized by the header's size field:

    buf = tarfile.fileobj.read(self._block(self.size))

The size is taken from the archive and is not validated, so a ~512-byte
crafted file can claim several gigabytes (or, via base-256 encoding, far
more) and make read() pre-allocate that much memory -- on open/iterate,
before any extraction filter runs.

Read the extended-header data in bounded chunks instead, so an oversized
or truncated header can no longer force a huge allocation. The bytes
returned for valid archives are unchanged.

@vstinner vstinner left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread Lib/test/test_tarfile.py Outdated
Comment thread Lib/test/test_tarfile.py Outdated
Comment thread Lib/test/test_tarfile.py Outdated
Comment thread Lib/tarfile.py
# bounded chunks to avoid a huge up-front allocation when a crafted or
# truncated archive claims far more data than the file actually contains
# (gh-151497).
_EXTHEADER_READ_CHUNK = 1024 * 1024 # 1 MiB

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the _safe_read() argument when running test_tarfile. If I ignore the 4 GiB outlier, the size is between 512 bytes and 4 kiB. So a limit of 1 MiB sounds reasonable to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.

Comment thread Lib/tarfile.py
# bounded chunks to avoid a huge up-front allocation when a crafted or
# truncated archive claims far more data than the file actually contains
# (gh-151497).
_EXTHEADER_READ_CHUNK = 1024 * 1024 # 1 MiB

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.

Comment thread Lib/tarfile.py Outdated
"""Read up to *size* bytes from *fileobj* in bounded chunks.

Returns the same bytes as ``fileobj.read(size)`` would (including a short
result at end of file), but never pre-allocates *size* bytes, so an

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: it will preallocate size bytes if size is small.

Suggested change
result at end of file), but never pre-allocates *size* bytes, so an
result at end of file), but limits pre-allocation, so an

…, assert against _EXTHEADER_READ_CHUNK, fix _safe_read docstring
@iamsharduld

Copy link
Copy Markdown
Contributor Author

Thanks @vstinner and @encukou for the review — all addressed in 560b630:

  • Renamed _ReadSizeRecorderReadSizeRecorder and dropped the _ prefixes on crafted_archive() / check().
  • Decorated ExtendedHeaderMemoryTest with @support.cpython_only and assert against the private tarfile._EXTHEADER_READ_CHUNK instead of the magic 10 MiB (so it's assertLessEqual, since a single read of exactly the chunk size is expected).
  • Reworded the _safe_read docstring — it does pre-allocate for a small size, it just bounds the pre-allocation.

Kept the 1 MiB chunk limit as discussed. PTAL when you get a chance.

Comment thread Lib/tarfile.py
@encukou encukou merged commit da99711 into python:main Jun 23, 2026
54 checks passed
@encukou

encukou commented Jun 23, 2026

Copy link
Copy Markdown
Member

Thank you!

@encukou encukou added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes needs backport to 3.15 pre-release feature fixes, bugs and security fixes labels Jun 23, 2026
@miss-islington-app

Copy link
Copy Markdown

Thanks @iamsharduld for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.15.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

@miss-islington-app

Copy link
Copy Markdown

Thanks @iamsharduld for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

@miss-islington-app

Copy link
Copy Markdown

Thanks @iamsharduld for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

@bedevere-app

bedevere-app Bot commented Jun 23, 2026

Copy link
Copy Markdown

GH-151977 is a backport of this pull request to the 3.15 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.15 pre-release feature fixes, bugs and security fixes label Jun 23, 2026
@bedevere-app

bedevere-app Bot commented Jun 23, 2026

Copy link
Copy Markdown

GH-151978 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.13 bugs and security fixes label Jun 23, 2026
@bedevere-app

bedevere-app Bot commented Jun 23, 2026

Copy link
Copy Markdown

GH-151979 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.14 bugs and security fixes label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants